Introduction to Applied Statistical Learning

What is Statistical Learning?

Involves utilizing a vast set of tools for understanding data
We can break these tools into two broad categories: Supervised and Unsupervised
We will utilize these tools to build statistical models for predicting or estimating an output based on one or more inputs

Predict whether someone will have a heart attack on the basis of demographic, diet, and clinical measurements

Establish the relationship between salary and demographic variables in population survey data

Outcome measurement: Y
- dependent variable, response, target
Vector of p predictor measurements X
- inputs, regressors, covariates, features, independent variables
In the regression problem, Y is quantitative (e.g., price, blood pressure, temperature)
In the classification problem, Y takes values in a finite, unordered set (e.g., survived/died, digit 0-9, cancer class of tissue sample)
We have training data \((x_1, y_1),...,(x_N, y_N)\). These are observations of these measurements

On the basis of the training data we would like to:
- Accurately predict unseen test cases
- Understand which inputs (independent variables) affect the outcome (dependent variable) and how
- Access the quality of oour prediction and inferences

It is important to understand the ideas behind the various techniques, in order to know how and when to use them
One has to understand the simpler methods first, in order to grasp the more sophisticated ones
It is important to accurately assess the performance of a method, to know how well or how badly it is working [simpler methods often perform as well as fancier ones!]
This is an exciting research area, having important applications in science, industry and finance
Statistical learning is a fundamental ingredient in the training of a modern data scientist

No outcome (dependent) variable
- Just a set of predictors (independent variables, features, etc.) measured ono a set of samples
Objective is more fuzzy
- Find groups of samples hat behave similarly
- Find features (predictors) that behave similarly
- Find linear combinations of features (predictors, independent variables) with the most variation
Difficult to know how well you are doing
Different from supervised learning, but can be useful as a pre-processing step for supervised learning